Introduction to R and RStudio

Introduction

General information

Programming language that allows you to:

  1. manipulate data: import, transform, export, etc.
  2. carry out more or less complex statistical analyses: description, exploration, modelization…
  3. create (pretty) figures

Features:

  • available on Windows, Mac and Linux.
  • free and open source
  • large user community/online help.
  • large number of specific packages.

History:

  • 1993: start of the R project
  • 2000: release of R 1.0.0
  • 2024: R 4.3.3

Fun fact (not so fun) about why R is better than excel:
In 2020, “Covid : le Royaume-Uni passe à côté de milliers de cas à cause… d’un fichier Excel arrivé à saturation”
More information comparing R and excel here.

The RGui

After installing R, you can use it by double-clicking on the R icon logo_R.png

RGui

As you can see, the default R interface (RGui console) is not very user friendly. So it is better to use complementary software that will function as a graphical interface between you and R. This interface is a kind of shell that makes R work in the background. Several graphical interfaces have been developed, but the most used and practical is RStudio.

Rstudio

logo_Rstudio

First sight

rstudio Studio displays 4 large panes. Their position may be changed based on your preference. Here are default:

Pane names and postions
left right
upper Script pane Environment/History pane
lower Console pane Help/Plots/Files

Info: your four panes may be blank while two of mines are filled with text. We’ll come on that later.

The console pane (lower left)

console_pane_image This is a simple R console, like in the RGui (see previous section).

Warning: Here, is an example provided by the bioinformatics plateform BiGR. Your local RStudio might differ: the version of R, the list of available packages, etc. On your local machine, RStudio console will match with the RGui.

Let’s try to enter the command print():

print("Hello World")
[1] "Hello World"

We just used a function, called print. This function tries to print on screen everything provided between parenthesis ( and ). In this particular case, we gave the character string "Hello World", and the function print successfully printed it on screen !

Now click on Session -> Save Workspace as and save the current work space (I named mine “my_session.RData”). What append in the R console pane? You saw it! A command has been automatically written. For me, it is:

save.image("my_session.RData")

When you need help with R, whether on a function error, on a script result or anything alike, please save your work space and send-it to your favorite R-developer. This contains everything you did in your session.

Info: There is a syntax coloration, there is a good autocompletion and parameter suggestion. All bioinformaticians use auto-completion, but shhh it’s our secret

The environment/history pane (upper right)

This pane has three main tabs: Environment, History and Connections (the tab Tutorial was added recently).

envhit_pane

Environment

Environment lists every single variable, object or data loaded in R. This includes only what you typed yourself and does not include environment variables. Example; in you console pane, enter the following command:

my_var <- 0  # May also be written my_var = 0

What append in the Environment pane ? You’re right: a variable is now available!

env_my_var

When a more complex object is declared in your work space, then some general information may be available. Example:

small_table <- data.frame("col_a"=c(1, 3), "col_b"=c(2, 4))

You can see the dataframe. Click on it to have a preview of the data it contains, then click on the light-blue arrow have a deeper insight of its content: df_expanded_env

Now click on Session -> Clear Work space: and see your work disappear. This action cannot be undone. While it is useful to clear one work space from time to time in order to avoid name space collisions, it is better to save your work space before.

History

goto_history

history_theme

This tab is quite important: while you test and search in the console, your history keeps a track of each command line you entered. This will definitely help you to build your scripts, to pass your command lines to your coworkers, and to revert possible unfortunate errors.

Each history is related to a session. You may see many commands in your history. Some of them are not even listed in your console. R Studio in writes there every command, even the ones that were masked for the sake of your eyes (knitting commands, display commands, help commands, etc.)

Please note that your history has a limit and only saves the latest command lines.

The help/plots/files pane (lower right)

This pane has four main tabs: Files, Plots, Packages and Help (the tab Viewer is not really used, and Presentation was added recently).

helpFile_pane

Help

This is maybe the most important pane of your R Studio. THIS is the difference between R Studio and another code editor. Search for any function here and not on the internet. This pane shows you the available help for YOUR version of R, YOUR version of a given package.

Concurrent version might have both different default parameters and different interfaces. Please be sure over the internet, to copy and type commands that are not harmfull for your computer.

Never ever copy code from the internet right to your console Why? Example: https://www.wizer-training.com/blog/copy-paste

Files

Just like any file explorer, we can move accross directories, create folders and file, delete them, etc.

create_working_dir_rstudio

Or use the function dir.create():

dir.create("Intro_R")

You should change your working directory right now:

setwd

Or use setwd():

setwd("Intro_R")

You can delete files:

download

or use the function file.remove():

file.remove("annotation.csv")
file.remove("expression.txt")

Packages

Here is listed all installed packages, with their description and their version.

package_pane

More information about packages in a next section.

Plots

If you work with R scripts, the graphs will be displayed here.

The script pane (upper left)

script_pane

This is where you write your R scripts. This also accepts other languages (e.g. bash, python, …), but R Studio shines for its R integration.

Please, please ! Write your commands in the Script pane, then execute them by hitting CTRL + Enter. This is very much like your lab-workbook: the history panel only keeps a limited number of function in memory while this script keeps your commands in a file on your disk. You may share it, edit it, comment it, etc.

The extension for a script in R is .R or.r (or .Rmd for special Rmarkdown script), for example my_script.R.

script_pane

TLDR – Too Long Didn’t Read

Graphic interface presentation :

  1. Write command lines in Script pane (upper left)
  2. Execute command lines by hitting CTRL + Enter from script pane et see them in the console.
  3. Have a look at the environment and history in case on the upper right pane
  4. Search for help in the lower right pane.

R – Basics

Variables and types

Numbers

Remember, a variable is the name given to a value stored in memory. Example 3, the number three, exists in R. You can store it in a variable with the arrow operator <-:

three <- 3

With the code above, the number 3 is stored in a variable called “three”. You can do this in R with anything. Literally anything. Whole files, pipelines, images, anything.

Maths in R works the same as your regular calculator:

3 + three # Add
[1] 6
1 - 2 # Subtract
[1] -1
4 / 2 # Divide
[1] 2
3 * 4 # Multiply
[1] 12
7 %/% 2 # Floor division
[1] 3

Info: # is the way to write a comment into your script. The instruction after #, and on the same line, will not be executed by R. For example in 3 + three # Add, 3 + three will be executed, but # Add will not. A good coder uses comments a lot to explain the calculations made.

Characters

Characters are delimited with quotes: either double " or ' simple:

four <- "4"
five <- '5'

# The example below is a very good example of
# how to never ever name a variable.
<- "happy"

Mathematics do not work with characters at all … Try the following:

"4" + 1
four + 1

You can try to turn characters in numbers with the function: as.numeric():

as.numeric("4") + 1
[1] 5
as.numeric(four) + 1
[1] 5

A function is a R command that is followed by parenthesis ( and ). Between these parenthesis, we enter arguments. Use the help pane to have information about the list of arguments expected and/or understood by a given function.

As said previously, you can store any of the previously typed commands in a variable:

five <- as.numeric("4") + 1
two <- 1 + (0.5 * 2)
print(five)
[1] 5
print(two)
[1] 2

Please! Please! Give your variable a name understandable by humans. I don’t want to see any of you calling their variable “a”, “b”, “my_awsome_var”, …

Tricky Question:

I have two numbers: mysterious_number_7, and suspicious_number_7. When I apply the function print() on them, it return 7. They are both numeric. However, they are not equal … Why ?

# Show the value of the variable mysterious_number_7
print(mysterious_number_7)
[1] 7
# Show the value of the number suspicious_number_7
print(suspicious_number_7)
[1] 7
# Check that mysterious_number_7 is a number
is.numeric(mysterious_number_7)
[1] TRUE
# Check that suspicious_number_7 is a number
is.numeric(suspicious_number_7)
[1] TRUE
# Check that values of mysterious_number_7 and suspicious_number_7 are equal
mysterious_number_7 == suspicious_number_7
[1] FALSE
# Check that values of mysterious_number_7 and suspicious_number_7 are identical
identical(mysterious_number_7, suspicious_number_7)
[1] FALSE

We will talk about difference between equality and identity later.

Answer

This is due to the number of digits displayed in R. You are very likely to have issues with that in the future, as all (bio)informatician around the world.

mysterious_number_7 <- 7.0000001
suspicious_number_7 <- 7
print(mysterious_number_7)
[1] 7
print(suspicious_number_7)
[1] 7
mysterious_number_7 == suspicious_number_7
[1] FALSE
identical(mysterious_number_7, suspicious_number_7)
[1] FALSE

You can change the number of displayed digits with the function options(): options(digits=100)

Boolean

Aside from characters and numeric, there is another very important type in R (and computer science in general): booleans. There are two booleans: TRUE and FALSE.

3 > 4
[1] FALSE
5 < 10
[1] TRUE
5 == 10
[1] FALSE

Data structures

Until now, we have seen simple information stored into a variable. But we can create a more complexe structure in order to store several information into a single variable.

data_structure

Vector

You can make vectors and tables in R. Don’t panic, there will be no maths in this presentation.

In R, vectors are created with the function c():

one2three <- c("1", "2", "3", "4", "10", "20")
print(one2three)
[1] "1"  "2"  "3"  "4"  "10" "20"
is.vector(one2three)
[1] TRUE

One can select an element of the vector with squared brackets [ and ]:

one2three[1] #select the first element
[1] "1"

One can select multiple elements of a vector with ::

one2three[2:4] #select from second to fourth element
[1] "2" "3" "4"

Question 1: Is there a difference between these two vectors ?

c_vector <- c("1", "2", "3")
n_vector <- c( 1,   2,   3 )
Answer

There is a difference indeed: c_vector contains characters, n_vector contains numeric.

print(c_vector)
[1] "1" "2" "3"
print(n_vector)
[1] 1 2 3
print(is.numeric(c_vector))
[1] FALSE
print(is.numeric(n_vector))
[1] TRUE
identical(c_vector, n_vector)
[1] FALSE

You can always use the function identical() to test equality with robustness and exactitude.

You may have learned about the operator == for equality. But this is not perfect, look at our example:

c_vector == n_vector
[1] TRUE TRUE TRUE

The operator == is not aware of types.

Another example, mixing numeric and boolans:

1 == TRUE
[1] TRUE
identical(1, TRUE)
[1] FALSE

In computer science, there is a reason why boolean and integers are mixed. We won’t cover this reason now. It’s out of our scope. Feel free to ask if you’re interested in history and maths. ###TO DO!!!

Question 2: Can I include both text and numbers in a vector ?

mixed_vector <- c(1, "2", 3)
Answer

No. We can not mix types in a vector. Either all its content is made of number or all its content is made of characters.

Here, all our values have been turned into characters:

print(mixed_vector)
[1] "1" "2" "3"
print(is.numeric(mixed_vector))
[1] FALSE
print(is.character(mixed_vector))
[1] TRUE
print(all(is.numeric((mixed_vector))))
[1] FALSE
print(all(is.character((mixed_vector))))
[1] TRUE

Above, the function all() returns TRUE if all its content equals to TRUE.

Question 3: How to create an histogram with a vector ?

Help A simple way to visualize your data is to use a graph. The function hist() may help you (of course, use the Help pane!!).
Answer
hist(c_vector)

Error in hist.default(c_vector) : ‘x’ must be numeric

Why this command is not working ? The error says : “‘x’ must be numeric”. The function accept only vector composed by numeric values.

hist(n_vector) # worked perfectly !

Data Frame

In R, tables are created with the function data.frame():

one2three4 <- data.frame(c(1, 3), c(2, 4))
print(one2three4)
  c.1..3. c.2..4.
1       1       2
2       3       4

By default, R gives names for columns and rows.
You can rename columns and row names respectively with functions colnames() and rownames().

colnames(one2three4) <- c("Col_1_3", "Col_2_4")
rownames(one2three4) <- c("Row_1_2", "Row_3_4")
print(one2three4)
        Col_1_3 Col_2_4
Row_1_2       1       2
Row_3_4       3       4

You can access a column and a line of the data frame using squared brackets [ and ]. Use the following syntax: [row, column]. Use either the name of the row/column or its position.

# Select a row by its name
print(one2three4["Row_1_2", ])
        Col_1_3 Col_2_4
Row_1_2       1       2
# Select a row by its index
print(one2three4[1, ])
        Col_1_3 Col_2_4
Row_1_2       1       2
# Select a column by its name
print(one2three4[, "Col_1_3"])
[1] 1 3
# Select a column by its index
print(one2three4[, 1])
[1] 1 3
# Select a cell in the table
print(one2three4["Row_1_2", "Col_1_3"])
[1] 1
# Select the first two rows and the first column in the table
print(one2three4[1:2, 1]) 
[1] 1 3

If you like maths, you will remember the order [row, column]. If you’re not familiar with that, then you will do like 99% of all software engineer: you will write [column, row], and you will get an error. Trust me. 99%. Remember, an error is never a problem in informatics.

Question 1: Can I mix characters and numbers in a data frame row ?

Answer

Yes, it is possible:

mixed_data_frame <- data.frame(
  "Character_Column" = c("a", "b", "c"),
  "Number_Column" = c(4, 5, 6)
)
print(mixed_data_frame)
  Character_Column Number_Column
1                a             4
2                b             5
3                c             6

The function str() can be used to look at the types of each elements in an object.

str(mixed_data_frame)
'data.frame':   3 obs. of  2 variables:
 $ Character_Column: chr  "a" "b" "c"
 $ Number_Column   : num  4 5 6
str(one2three4)
'data.frame':   2 obs. of  2 variables:
 $ Col_1_3: num  1 3
 $ Col_2_4: num  2 4

Question 2: Can I mix characters and numbers in a data frame column ?

Answer

No:

mixed_data_frame <- data.frame(
  "Mixed_letters" = c(1, "b", "c"),
  "Mixed_numbers" = c(4, "5", 6)
)
print(mixed_data_frame)
  Mixed_letters Mixed_numbers
1             1             4
2             b             5
3             c             6
str(mixed_data_frame)
'data.frame':   3 obs. of  2 variables:
 $ Mixed_letters: chr  "1" "b" "c"
 $ Mixed_numbers: chr  "4" "5" "6"

Question 3: How can you add 2 for each cell of the dataframe ?

Answer
three4five6 <- one2three4 + 2
three4five6
        Col_1_3 Col_2_4
Row_1_2       3       4
Row_3_4       5       6

Read a table as data frame

Exercise: Use the Help pane to find how to use the function read.csv().

You can find example_table.csv here(download it by clocking on the “Download raw file” button).

Use the function read.csv() to:

  1. open the file example_table.csv.
  2. this table has a header (TRUE).
  3. this table has row names in the column called “Gene_id”.

Let all other parameters to their default values.

Save the opened table in a variable called example_table.

Answer
example_table <- read.csv(
  file="example_table.csv", 
  header=TRUE, 
  row.names="Gene_id"
)

Now let us explore this dataset.

We can click on environment pane:

see_in_the_env_pane

And if you click on it:

open_example_table

Be careful, large table may hang your session.

Alternatively, we can use the function head() which prints the first lines of a table:

head(example_table)
        Sample1   Sample2   Sample3   Sample4
Caml   9.998194 10.004116  9.172489  9.139667
Scamp5 9.995917 10.818685 11.417558 14.907892
Dgki   9.993974 13.664396 16.132275 17.420057
Mas1   9.993956 11.370854 11.233629  9.912863
Apba1  9.992540 14.253438 14.001228 13.654701
Phkg2  9.980898  8.748654  8.714821  9.146529

The function summary() describes the dataset per sample:

summary(example_table)
    Sample1          Sample2           Sample3           Sample4       
 Min.   : 9.944   Min.   :  6.838   Min.   :  5.551   Min.   :  5.844  
 1st Qu.: 9.953   1st Qu.:  9.000   1st Qu.: 10.120   1st Qu.:  9.779  
 Median : 9.971   Median : 10.954   Median : 11.326   Median : 11.905  
 Mean   :18.937   Mean   : 19.836   Mean   : 20.828   Mean   : 21.412  
 3rd Qu.: 9.994   3rd Qu.: 12.647   3rd Qu.: 12.650   3rd Qu.: 13.968  
 Max.   :99.784   Max.   :105.077   Max.   :112.188   Max.   :111.820  

Have a look at the summary() of the dataset per gene, using the function t() to transpose:

head(t(example_table))
             Caml    Scamp5      Dgki      Mas1    Apba1    Phkg2    Timm8b
Sample1  9.998194  9.995917  9.993974  9.993956  9.99254 9.980898  99.78373
Sample2 10.004116 10.818685 13.664396 11.370854 14.25344 8.748654 105.07739
Sample3  9.172489 11.417558 16.132275 11.233629 14.00123 8.714821 112.18819
Sample4  9.139667 14.907892 17.420057  9.912863 13.65470 9.146529 109.09544
            Capn7     Yrdc    Coq10a   Gm27000    Lrrc41    Acadsb    Pdzd11
Sample1  9.976005 9.971093  9.970835  9.965511  9.960667  9.959179  9.952750
Sample2 11.314599 8.905508  8.820582  7.414795  9.961954 11.261520  9.031553
Sample3 11.452421 7.367243 10.449131  7.709008 10.435298 12.336088 10.700876
Sample4 11.692871 9.375526 10.865062 13.126211  9.137375 12.703318 10.832218
          Smarca2   Gm26079     Ptpn5    Rexo2     Ifi27   Snhg20
Sample1  9.952224  99.51466  9.947524  9.94634  9.943989 9.943724
Sample2  9.272424 103.08963 11.090058 13.36391 12.407626 6.838499
Sample3 11.194709 109.85654 11.572261 11.47744 13.591186 5.551247
Sample4 12.117571 111.82050 10.255021 12.29288 14.906542 5.843670
summary(t(example_table))
      Caml            Scamp5            Dgki             Mas1       
 Min.   : 9.140   Min.   : 9.996   Min.   : 9.994   Min.   : 9.913  
 1st Qu.: 9.164   1st Qu.:10.613   1st Qu.:12.747   1st Qu.: 9.974  
 Median : 9.585   Median :11.118   Median :14.898   Median :10.614  
 Mean   : 9.579   Mean   :11.785   Mean   :14.303   Mean   :10.628  
 3rd Qu.:10.000   3rd Qu.:12.290   3rd Qu.:16.454   3rd Qu.:11.268  
 Max.   :10.004   Max.   :14.908   Max.   :17.420   Max.   :11.371  
     Apba1            Phkg2           Timm8b           Capn7       
 Min.   : 9.993   Min.   :8.715   Min.   : 99.78   Min.   : 9.976  
 1st Qu.:12.739   1st Qu.:8.740   1st Qu.:103.75   1st Qu.:10.980  
 Median :13.828   Median :8.948   Median :107.09   Median :11.384  
 Mean   :12.975   Mean   :9.148   Mean   :106.54   Mean   :11.109  
 3rd Qu.:14.064   3rd Qu.:9.355   3rd Qu.:109.87   3rd Qu.:11.513  
 Max.   :14.253   Max.   :9.981   Max.   :112.19   Max.   :11.693  
      Yrdc           Coq10a          Gm27000           Lrrc41      
 Min.   :7.367   Min.   : 8.821   Min.   : 7.415   Min.   : 9.137  
 1st Qu.:8.521   1st Qu.: 9.683   1st Qu.: 7.635   1st Qu.: 9.755  
 Median :9.141   Median :10.210   Median : 8.837   Median : 9.961  
 Mean   :8.905   Mean   :10.026   Mean   : 9.554   Mean   : 9.874  
 3rd Qu.:9.524   3rd Qu.:10.553   3rd Qu.:10.756   3rd Qu.:10.080  
 Max.   :9.971   Max.   :10.865   Max.   :13.126   Max.   :10.435  
     Acadsb           Pdzd11          Smarca2          Gm26079      
 Min.   : 9.959   Min.   : 9.032   Min.   : 9.272   Min.   : 99.51  
 1st Qu.:10.936   1st Qu.: 9.722   1st Qu.: 9.782   1st Qu.:102.20  
 Median :11.799   Median :10.327   Median :10.573   Median :106.47  
 Mean   :11.565   Mean   :10.129   Mean   :10.634   Mean   :106.07  
 3rd Qu.:12.428   3rd Qu.:10.734   3rd Qu.:11.425   3rd Qu.:110.35  
 Max.   :12.703   Max.   :10.832   Max.   :12.118   Max.   :111.82  
     Ptpn5            Rexo2            Ifi27            Snhg20     
 Min.   : 9.948   Min.   : 9.946   Min.   : 9.944   Min.   :5.551  
 1st Qu.:10.178   1st Qu.:11.095   1st Qu.:11.792   1st Qu.:5.771  
 Median :10.673   Median :11.885   Median :12.999   Median :6.341  
 Mean   :10.716   Mean   :11.770   Mean   :12.712   Mean   :7.044  
 3rd Qu.:11.211   3rd Qu.:12.561   3rd Qu.:13.920   3rd Qu.:7.615  
 Max.   :11.572   Max.   :13.364   Max.   :14.907   Max.   :9.944  
To go further
# number of column
ncol(example_table)
[1] 4
# number of row
nrow(example_table)
[1] 20
# get dimension (number of row and number of column)
dim(example_table)
[1] 20  4
# type of each elements
str(example_table)
'data.frame':   20 obs. of  4 variables:
 $ Sample1: num  10 10 9.99 9.99 9.99 ...
 $ Sample2: num  10 10.8 13.7 11.4 14.3 ...
 $ Sample3: num  9.17 11.42 16.13 11.23 14 ...
 $ Sample4: num  9.14 14.91 17.42 9.91 13.65 ...

TLDR – Too Long Didn’t Read

# Declare a variable, and store a value in it:
three <- 3

# Basic operators: + - / * work as intended:
six <- 3 + 3

# Quotes are used to delimiter text:
seven <- "7"

# You cannot perform maths on text:
"7" + 8 # raises an error
seven + 8 # also raises an error
six + 8 # works fine

# You can change the type of your variable with:
as.numeric("4") # the character '4' becomes the number 4
as.character(10) # the number 10 becomes the character 10

# You can compare values with:
six < seven
six + 1 >= seven
identical(example_table, mixed_data_frame)


# You can load and save a dataframe with:
read.table(file = ..., sep = ..., header = TRUE)
write.table(x = ..., file = ...)

# Create a table with:
my_table <- data.frame(...)

# Create a vector with:
my_vector <- c(...)

# You can see the firs lines of a dataframe with:
head(example_table)

# Search for help in the help pane or with:
help(...)

R – Packages

What are modules and packages

Modules and package are considered to be the same thing in this lesson. The difference is technical and does not relates to our session.

Most of the work you are likely to do with R will require one or several packages. A package is a list of functions or pipelines shipped under a given name. In general, a package groups together functions linked to an analysis theme or the same objective. Every single function you use through R comes from a package or another.

Read the very first line of the help pane:

help(head)

It reads: help {utils}. The function help comes from the package utils.

# Call the function "help", with the argument "example_table", and show only the first line
head(example_table, 1)
      Sample1  Sample2  Sample3  Sample4
Caml 9.998194 10.00412 9.172489 9.139667

Warning: Sometime, two package may have a function with the same name. They are most certainly not doing the same thing. IMHO, it is a good habbit to always call a function while disambiguating the package name. utils::help() is better than help() alone.

# Call the function "help" ***from the package utils***, with the argument "example_table", and show only the first line
utils::head(example_table, 1)
      Sample1  Sample2  Sample3  Sample4
Caml 9.998194 10.00412 9.172489 9.139667

Install a package

You may install a new package.

Use install.packages() to install a package.

# Install a package with the following function
install.packages("dplyr")

This will raise a prompt asking for simple questions : where to download from (choose somewhere in France), whether to update other packages or not, etc.

Do not be afraid by the large amount of things prompted in the console and let R do the trick.

Alternatively, you can click Tool -> Install Packages in RStudio; or click on the “install” button in the tab Packages of the pane File/Help.

You can list installed packages with installed.packages(), and find for packages that can be updates with old.packages(). These packages can be updated with update.packages().

While the function install.packages() searches packages in the common R package list, many bioinformatics packages are available on other shared packages warehouses. Just like AppleStore and GoogleStore do not have the same applications on mobile, R has multiple sources for its packages. You need to know one of them, and one only Bioconductor.

bioconductor

You can use Bioconductor with the function BiocManager::install():

# Install BiocManager, a package to use Bioconductor
install.packages("BiocManager")

# Install a package from Bioconductor
BiocManager::install("DESeq2")

Use a package

All installed packages are not activated in your working session. You can load a package with the function library():

library(package="dplyr")

If the package is not installed, you will get an error.

If there is no error message, then you can try:

help(topic="arrange", package="dplyr")

And search for help about how to run your command.

Alternatively, there is a more complete help page, with the function browseVignettes(). It opens your browser automatically, and if you click on “HTML”, you get some information about the package like functions, tutorials, etc.

browseVignettes(package="dplyr")

TLDR – Too Long Didn’t Read

# Install a package with the following function
install.packages("BiocManager")

# Load a package
library("BiocManager")

# Install a package from Bioconductor
BiocManager::install("DESeq2")

# Get help
browseVignettes(package="DESeq2")

Tips for your project

Write a good script

Good practice:

  • make a documentation (a header at the start of the script which explains the purpose of the script, and the analysis steps for example),
  • comments (uninterpreted line, begin by #),
  • code indentation (spaces before code line),
  • understandable variable names,
  • do not nest too many functions inside each other,
### difficult to understand
    print(rowMeans(data.frame(c(9, 14, 17, 9, 13),
c(11, 10, 20, 7, 17),c(15, 8,      19, 10, 15)   ))       )

### easy to understand
## Goal: this script computes the mean of the expression of our 3 samples for each gene:
#create a dataframe with the genes expression of our 3 samples:
example_data_frame <- data.frame("Expression_Sample_1" = c(9, 14, 17, 9, 13),
                                 "Expression_Sample_2" = c(11, 10, 20, 7, 17),
                                 "Expression_Sample_3" = c(15, 8, 19, 10, 15)
                      )
#add corresponding genes names into row names:
rownames(example_data_frame) <- c("Caml", "Scamp5", "Dgki", "Mas1", "Apba1")
#compute the mean of the expression for each gene:
mean_expression_Samples123 <- rowMeans(example_data_frame)
#print the result:
print(mean_expression_Samples123)
  • save your script regularly, as well as your working environment,
  • save the versions of the loaded packages at the end of your analysis (you can print loaded packages thanks to the function sessionInfo() and save the result into a file thanks to the function capture.output()).
sessionInfo() #display versions of loaded packages in the console
R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
 [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] digest_0.6.35        R6_2.5.1             bookdown_0.39       
 [4] fastmap_1.1.1        xfun_0.43            cachem_1.0.8        
 [7] knitr_1.46           htmltools_0.5.8.1    rmarkdown_2.26      
[10] lifecycle_1.0.4      cli_3.6.2            rmdformatsbigr_1.0.0
[13] sass_0.4.9           jquerylib_0.1.4      compiler_4.3.3      
[16] highr_0.10           rstudioapi_0.15.0    tools_4.3.3         
[19] evaluate_0.23        bslib_0.7.0          yaml_2.3.8          
[22] jsonlite_1.8.8       rlang_1.1.3         
utils::capture.output(sessionInfo(), file="sessionInfo.txt") #save them in a file

Load and save R objects

While working on your projects, you will process datasets in R. The results of these analyses will be stored on variables. This means, that when you close RStudio, some of this work might be lost.

We already saw the function save.image() to save a complete copy of your working environment.

However, you can save only the content of a given variable. This is useful when you want to save the result of a function (or a pipeline) but not the whole 5 hours of work you’ve been spending on how-to-make-that-pipeline-work-correctly.

The format is called: RDS for R Data Serialization. This is done with the function saveRDS():

saveRDS(object = example_table, file = "example_table.RDS")

You can also load a RDS into a variable. This is useful when you receive a RDS from a coworker, or you’d like to keep going your work from a saved point. This is done with the function readRDS():

example_table <- readRDS(file = "example_table.RDS")
head(example_table)

Human data

Warning: If you hold human-related genomic datasets. You cannot use/upload these data anywhere. This is illegal, and doing surch thing may lead to 5 years in prison and up to 300 000€ fine. Art. 226-16, Section 5, Code pénal.

Packages updates

It is a good practice to maintain package versions within a work project. If you update a package (whether by need, or by will), then you should restart your work from the begining. This stands as long as you’re not 100% sure the update does not affect your results.

Swirl R package

How to continue to learn R?

swirl_new_large_final

What is swirl?

swirl is an R package that teaches you R programming and data science interactively, at your own pace, and right in the R console.

It presents a choice of course lessons and interactively tutors a student through them. A student may be asked to watch a video, to answer a multiple-choice or fill-in-the-blanks question, or to enter a command in the R console precisely as if he or she were using R in practice. Emphasis is on the last, interacting with the R console. User responses are tested for correctness and hints are given if appropriate.

Progress is automatically saved so that a user may quit at any time and later resume without losing work.

Installation and using

#install package
install.packages("swirl")
#load package
library(swirl)
#install the R course for 
install_course("R Programming")
#start the course
swirl()

Enjoy!

Other command lines for swirl using
#quit swirl
bye()
#skip a question
skip()
#return to the main menu
main()
#allow experimentation in the R console without interference from swirl
play()
#to resume interacting with swirl
nxt()
#display a help menu
info()

Conclusion

No programming language is better than any other. Anyone saying the opposite is (over)-specialized in the language they are advertising.

In the field of bioinformatics, languages used by the community are quite limited. THere are bash, R and Python. While learning bash cannot be escaped nowadays, it is not enough to perform a complete analysis with publication ready figures and results. You should be interested in another programming language: R and/or Python. R allows you to do a lot of different analyses, and it has a large user community with lots of online help, so it’s one of the easiest languages for beginners.

Please, note that this advice is valid today, but may change. Other programming languages are used, some have lost their place on the podium, and others are trying to supersede bash, R, and Python.

Thibault: “Anyway Python is the best programming language in the WORLD. Don’t listen to Bastien.”